Welcome to our Natural Language Processing tutorial!
High-level overview:
Create an environment to use R AND python in the the SAME RMarkdown!
Access the Genius API to easily create custom dataset
Overview of Natural Language Processing
Text Pre-processing and formatting
Sentiment analysis
Topic modeling (bag of words): Latent Dirichlet Allocation
BERTopic advanced models
#install.packages('reticulate')
#install.packages('dotenv')
library("reticulate") # Incorporates python code
#install_miniconda()
library('dotenv') # Uses .env files to hide sensitive information; i.e. access codes
More info about using .env files: https://medium.com/towards-data-science/using-dotenv-to-hide-sensitive-information-in-r-8b878fa72020
For the first part of this tutorial we use the reticulate package to set up a special environment to which we can install python packages. This allows us to use both R AND python code within the same RMarkdown document.
It’s also possible to incorporate python classes into RMarkdown: http://theautomatic.net/2020/01/14/how-to-import-python-classes-into-r/
The following descriptions are from the reticulate github repo: https://rstudio.github.io/reticulate/
Create a conda environment to load python packages
Disclaimer: This code below using reticulate ‘should’ run without much issue on a Windows PC, but Mac OS and Linux may have unforeseen difficulties. Check dependencies, etc.; good luck!
# Initial installation code to the environment
# This chunk should take a couple minutes to install
# conda_create('r-reticulate')
#
# # Use pip=T for non-conda packages
# conda_install('r-reticulate',"scikit-learn")
# conda_install('r-reticulate',"lyricsgenius", pip=T)
# conda_install('r-reticulate',"contractions", pip=T)
# conda_install('r-reticulate',"nltk")
# conda_install('r-reticulate',"numpy")
# conda_install('r-reticulate',"pandas")
# conda_install('r-reticulate','gensim')
# conda_install('r-reticulate','python-flair')
# conda_install('r-reticulate', 'BERTopic')
# conda_install('r-reticulate', 'plotly')
Note: Install scikit-learn but then import sklearn
use_condaenv("r-reticulate") # Loads a pre-existing conda environment
#Imports python packages into the environment
import('sklearn') # Comprehensive machine learning toolkit
## Module(sklearn)
import('lyricsgenius') # access the Genius API
## Module(lyricsgenius)
import("nltk") # Natural language toolkit
## Module(nltk)
import('contractions') # expands contractions
## Module(contractions)
import('gensim') # Topic modeling for language
## Module(gensim)
import('flair') # State-of-the-Art NLP techniques
## Module(flair)
import('bertopic') # Advanced topic modeling
## Module(bertopic)
import('numpy') # duh
## Module(numpy)
import('pandas') # duh
## Module(pandas)
import('plotly')
## Module(plotly)
Using the lyricsgenius package we can easily access the Genius api, and with a couple of functions provided below we can have freedom to create datasets of the Genius Top charts with one line of code.
Note: Additional functions and instructions provided at the end of the tutorial
Some setup code
path <- getwd() # current working directory
load_dot_env("tokens.env") # Access hidden .env file
client_access_token <- Sys.getenv("client_access_token") # get access code
Note: If you get “Warning: incomplete final line found on ‘tokens.env’”, try hitting enter at the end of your .env file
Switching to python code
# ^^^ observe
import sklearn
import lyricsgenius
import nltk
import pandas as pd
import numpy as np
# Able to directly load base python packages
import os
genius = lyricsgenius.Genius(r.client_access_token) # Genius API agent
def top_charts(time_period='all_time',genre='all_time',
n_per_page=50,type_='songs',pages=1):
# Purpose: Access the topcharts Genius API to create a dataset
# - Results vary, but size is less than 200
# - Used in conjunction with song_info function
# Input:
# - tp: time period ‘day’, ‘week’, ‘month’ or ‘all_time’
# - genre: ‘all’, ‘rap’, ‘pop’, ‘rb’, ‘rock’ or ‘country’
# - per_page: 1 - 50
# - type_: item type: ‘songs’, ‘albums’, or ‘artists’
# - page: number of page number
# Output:
# dataframe of song ids, titles, artist, and lyrics
song_ids = list() # Lists to add to output data frame
# while-try loop because the request sometimes times out and will kill the loop
for pn in range(1,pages+1):
t = True
while t == True:
try:
songs = genius.charts(page=pn,time_period=time_period,
chart_genre=genre,per_page=n_per_page,type_=type_)
except:
pass
else:
t = False
n = len(songs['chart_items']) # number of hits
# get song ids
for song in range(0,n):
language = songs['chart_items'][song]['item']['language']
if language == 'en':
song_id = songs['chart_items'][song]['item']['api_path'].replace('/songs/','')
song_ids.append(song_id)
# call song_info function to retrieve lyrics
topchart_df = song_info(song_ids)
# option to save the dataframe as a csv file
path = os.getcwd()
csv_name = 'topchart_'+time_period+'_'+type_+'_'+genre+'.csv'
#song_df.to_csv(path+csv_name, index=False)
return topchart_df
def song_info(ids,time_period='all_time',genre='all',type_='songs'):
# Input: list of song ids
# Output: dataframe with song ids, artist names, lyrics, song titles.
try:
type(ids) == list
except:
print('input needs to be a list')
lyrics = list()
titles = list()
artists = list()
bad_song_ids = list()
song_ids = ids
# Access Genius API for each song_id
# The try/except/pass code is to protect the dataset creation from being
# terminated if there is a problem with any API call, and sometimes the Genius
# API for
for song in song_ids:
t = True
while t == True:
try:
a = genius.search_song(song_id=song)
except:
pass
else:
t = False
if (a!=None) and (a.to_text()!=None):
lyrics.append(a.to_text())
titles.append(a.title)
artists.append(a.artist)
else:
bad_song_ids.append(song)
# Eliminate corrupt songs
song_ids = [x for x in song_ids if x not in bad_song_ids]
# output data frame
song_df = pd.DataFrame({
'title': titles,
'lyrics': lyrics,
'artist': artists,
'song_ids': song_ids,})
return song_df
# dataset1 = top_charts(genre='country',n_per_page=50, pages=1,
# type_='songs',time_period='all_time')
Dataset options:
Note: The repeated calling of Genius API will (rarely) cause the automated functions to terminate, even with the try/expect/pass code, but it should work if you retry one or two more times.
Caveat: Some songs may appear twice with different versions (remix, acoustic). We decided not to remove duplicates because duplicate versions both appearing in top charts shows just how popular the song is.
Here I load previously created datasets that were created with the same functions above
path = os.getcwd()
path_rap = path+'/topchart_all_time_songs_rap.csv'
rap_df = pd.read_csv(path_rap, index_col=False)
path_rock = path+'/topchart_all_time_songs_rock.csv'
rock_df = pd.read_csv(path_rock, index_col=False)
path_pop = path+'/topchart_all_time_songs_pop.csv'
pop_df = pd.read_csv(path_pop, index_col=False)
path_rb = path+'/topchart_all_time_songs_rb.csv'
rb_df = pd.read_csv(path_rb, index_col=False)
path_country = path+'/topchart_all_time_songs_country.csv'
country_df = pd.read_csv(path_country, index_col=False)
dfs = [rap_df,rock_df, pop_df, rb_df, country_df]
full_df = pd.concat(dfs)
Before we can analyze our data, we must pre-process our text, as one would do for any Natural Language Processing. The main pre-processing steps we will focus on are data cleaning, tokenization, removing stop words, and normalization/lemmatization.
The first step is to clean our data of any unintended characters or symbols present in our text. As our data comes from using the package ‘lyricgenius’ to access the Genius API, there are many characters present in our text that we would wish to have removed. We can use regular expressions to search for these characters or groups of characters and remove them from our data.
Our next step is to tokenize our data. In NLP tokenizing is the process of turning unstructured data into discrete, usable units that we will use for our natural language processing. There are various forms of tokenizing, but we will use word tokenizing for our problem. This will split up our data, in this case the lyrics to each song, into separate units for each word in the lyrics. This unit, word, is what the natural language processing will use to analyze (as opposed to using words + common phrases or advanced modeling techniques that consider the ordered sequence of words like transformer models. Another common tokenizer used is sentence tokenization, which splits up the data by sentence.
Once our data is tokenized, we can remove stop words from our data. Stop words are common words that are not useful in NLP. Think words like ‘the’, ‘and’, or ‘an’. We remove these words before processing because they would serve little value in helping us classify our songs, and could even have significant negative impacts on our analysis by diluting the relevant words in data. We use the nltk library’s default stopwords, as well as a few more that are common in song lyrics. Additionally, we have included explicit words as stopwords, as we would rather not use these words to classify our genres.
Finally, we must normalize our text in some way. The two most common forms of text normalization are stemming and lemmatization. Stemming is a “heuristic” based approach that removes common ends to words, leaving just the “stems” remaining, e.g. boats, boatness -> boat. Lemmatizing, on the other hand, applies a “morphological” analysis of each word to determine the base, or “lemma”, of the word, e.g. mice -> mouse. Lemmatization is typically preferred to stemming, as it provides a more complex analysis of the true meaning of each word, and so we will use it in our analysis.
import re # python regular expressions
import copy
import contractions
import nltk
#nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
def process_text(df):
# Performs various text pre-processing steps
# Input: Data frame with a column named 'lyrics'
# Output: Same dataframe but with pre-processed text
df1 = copy.deepcopy(df)
lyrics = df1.lyrics
lyrics_final = list()
snow_stemmer = SnowballStemmer(language='english')
wnl = WordNetLemmatizer()
for lyric in lyrics:
# Removes brackets and text inside
song_lyrics = re.sub(r'\[.*?\]', '', lyric)
# Removes parentheses and text inside
song_lyrics = re.sub(r'\(.*?\)', '',song_lyrics)
# Finds start of lyrics
song_lyrics = song_lyrics[song_lyrics.find('Lyrics')+6:]
# Removes newlin char (\n)
song_lyrics = re.sub("\n"," ",song_lyrics)
# Removes leftover backslahes
song_lyrics = re.sub('\'', "plac3h0ler",song_lyrics)
# Removes leftover backslahes
song_lyrics = re.sub('plac3h0ler', r"'",song_lyrics)
# Removes text at the end of doc
song_lyrics = re.sub(".{3}Embed", "",song_lyrics)
# Lengthes contractions to full form
song_lyrics = contractions.fix(song_lyrics)
# Removes punctuation
song_lyrics = re.sub(r'[^\w\s]','',song_lyrics)
# Removes numbers
song_lyrics = re.sub("[^a-zA-Z]+", " ",song_lyrics)
# Tokenize words
word_tokens = word_tokenize(song_lyrics)
# Lemmatize words
lemma_words_tokens = [wnl.lemmatize(token) for token in word_tokens]
# stopwords
stop_words = stopwords.words('english')
sw = ['ayy', 'like', 'come', 'yeah', 'got', 'la', 'ya',
'oh', 'ooh', 'huh', 'whooaaaaa', 'o', 'n', 'x']
explict_words = ['nigga', 'nigger', 'bitch', 'bitchin', 'fag', 'faggot',
'fuck', 'fucked', 'fuckin', 'motherfucker', 'motherfuckin',
'pussy', 'dick', 'cock', 'whore','shit', 'shittin']
stop_words_final = stop_words + sw + explict_words
# Remove stopwords
filtered_lyrics = [token.lower() for token in lemma_words_tokens if
token.lower() not in stop_words_final]
# Join lyrics into one string
lyrics_joined = ' '.join(filtered_lyrics).lower()
lyrics_final.append(lyrics_joined)
df1 = df1.drop(['lyrics'], axis=1)
df1['lyrics'] = lyrics_final
return df1
Apply pre-processing to datasets
cleaned_full_df = process_text(full_df)
cleaned_rap_df = process_text(rap_df)
cleaned_rock_df = process_text(rock_df)
cleaned_pop_df = process_text(pop_df)
cleaned_rb_df = process_text(rb_df)
cleaned_country_df = process_text(country_df)
pd.set_option('display.max_columns', None)
print(cleaned_full_df.head())
## title artist song_ids Unnamed: 4 \
## 0 Rap God Eminem 235729 NaN
## 1 WAP Cardi B 5832126 NaN
## 2 HUMBLE. Kendrick Lamar 3039923 NaN
## 3 Bad and Boujee Migos 2845980 NaN
## 4 SICKO MODE Travis Scott 3876994 NaN
##
## lyrics
## 0 look wa going go easy hurt feeling going get o...
## 1 whores house house house house said certified ...
## 2 nobody pray day way remember syrup sandwich cr...
## 3 know young rich know something really never ol...
## 4 astro sun freezin cold already know winter daw...
Now that we have finished the ‘setup’ part we can finally get into the fun stuff
Natural language processing seeks to translate human language, like text or speech, to comprehensible and analyzable pieces for learning machines. NLP has common applications such as speech recognition, topic extraction, name-entity recognition, and sentiment analysis. The abundance of text data readily available has made natural language processing a growing field.
We will take a look at 2 main uses for natural language processing: Sentiment Analysis and Topic Modeling. Sentiment Analysis is used to determine the emotional sentiment around text, typically used to analyze reviews. Topic Modeling is an unsupervised machine learning technique aimed at classifying different documents into topics based on the words within each document. Both of these techniques have applications to our lyrics data, as we will be able to identify the sentiment of songs, as well as find possible topics among the songs.
Sentiment Analysis is a multinomial text classification where the emotional weight of the text (positive, neutral, negative) is calculated using natural language processing. Sentiment analysis has many applications, especially in analyzing reviews, surveys, and media. There are 2 major types of sentiment analysis: rule-based and embedding based.
Rule-based is the simpler approach, does not leverage machine learning, and bases its calculations on known datasets of words. This means that rule-based sentiment analysis could indicate which songs are negative if they use a common word like “sad”, but terms unfamiliar to a rule-based sentiment analysis library would be ignored and would not be able to be predicted. It also is unable to understand the context in which words are used, meaning that homonyms, often pop culture homonyms in the context of song lyrics, can only be interpreted one way with this approach.
Embedding based sentiment analysis, on the other hand, forms vector representations of words, where similar words are dimensionally similar. These vector representations can also be added together to represent the word combinations e.g. king + woman = queen. More info: https://neptune.ai/blog/sentiment-analysis-python-textblob-vs-vader-vs-flair
For our analysis, we will be using the Flair library, and their pre-trained sentiment analysis on an IMDB dataset. This library is especially advanced due to the type of embedding sentiment analysis it uses. Flair uses contextual string embeddings to determine the sentiment of words. This treats words as characters, and uses character language models as well as the embeddings of the surrounding text/characters to determine the word’s embedding. In practice, this means that words can be given different embeddings, or meanings, depending on the context. In our example, this means that words can be seen as positive or negative depending on the surrounding lyrics.
It should be noted that this sentiment analysis is trained on an IMDB dataset. For that reason, it may not produce the most accurate analysis and understanding of our words. To take this method one step further - if you have a big enough set - you can train your own sentiment analysis model using this package to create a model specified to the problem.
Copied from the flair github repo: https://github.com/flairNLP/flair
A powerful NLP library. Flair allows you to apply our state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS), special support for biomedical data, sense disambiguation and classification, with support for a rapidly growing number of languages.
A text embedding library. Flair has simple interfaces that allow you to use and combine different word and document embeddings, including our proposed Flair embeddings, BERT embeddings and ELMo embeddings.
A PyTorch NLP framework. Our framework builds directly on PyTorch, making it easy to train your own models and experiment with new approaches using Flair embeddings and classes.
from flair.models import TextClassifier
from flair.data import Sentence
# This is the pre-built model and will take awhile to download
classifier = TextClassifier.load('en-sentiment')
## 2022-12-14 14:39:49,610 loading file /Users/Brennan/.flair/models/sentiment-en-mix-distillbert_4.pt
def sentiment(df):
# Performs sentiment analysis on a text data
# Input: dataframe with a column named 'lyrics' for text data
# Output:
# Original dataframe with sentiment scores of individual songs (dataframe)
# summary statistics of scores per genre (float)
return_df = df
song_scores = []
lyrics = df.lyrics.tolist()
# Sums each songs sentiment score to gte an dataset average sentiment score
sum_ = 0
for lyric in lyrics:
sentence = Sentence(lyric)
classifier.predict(sentence)
text = str(sentence.labels)
song_scores.append(text.split('/')[1])
score = text.split('/')[1]
num = float(score.split('(')[1].split(')')[0])
if score.__contains__("NEGATIVE"):
num = num * -1
sum_ += num
return_df['Sentiment_score'] = pd.Series(song_scores)
return return_df, sum_
sent_df_rap, avg_scr_rap = sentiment(cleaned_rap_df)
print('rap genre: ', round(avg_scr_rap, 4), '\n\n' 'Song scores: ', sent_df_rap.head(10))
## rap genre: -49.7011
##
## Song scores: title artist song_ids \
## 0 Rap God Eminem 235729
## 1 WAP Cardi B 5832126
## 2 HUMBLE. Kendrick Lamar 3039923
## 3 Bad and Boujee Migos 2845980
## 4 SICKO MODE Travis Scott 3876994
## 5 God’s Plan Drake 3315890
## 6 Man’s Not Hot Big Shaq 3244990
## 7 XO TOUR Llif3 Lil Uzi Vert 3003630
## 8 1-800-273-8255 Logic 3050777
## 9 Bodak Yellow Cardi B 3095483
##
## lyrics Sentiment_score
## 0 look wa going go easy hurt feeling going get o... 'NEGATIVE' (0.9792)]
## 1 whores house house house house said certified ... 'POSITIVE' (0.9937)]
## 2 nobody pray day way remember syrup sandwich cr... 'NEGATIVE' (0.9863)]
## 3 know young rich know something really never ol... 'NEGATIVE' (0.7764)]
## 4 astro sun freezin cold already know winter daw... 'POSITIVE' (0.8489)]
## 5 wishin wishin wishin wishin wishin movin calm ... 'NEGATIVE' (0.8825)]
## 6 yo big shaq one mans hot never hot skrrat skid... 'POSITIVE' (0.9182)]
## 7 alright alright quite alright money right coun... 'NEGATIVE' (0.992)]
## 8 low taking time feel mind feel life mine low t... 'NEGATIVE' (0.9818)]
## 9 ksr cardi said wanted dance said lil wanted ex... 'NEGATIVE' (0.5963)]
Only showing one output
sent_df_rock, avg_scr_rock = sentiment(cleaned_rock_df)
print('rock genre: ', round(avg_scr_rock, 4), '\n\n' 'Song scores: ', sent_df_rock.head(10))
sent_df_pop, avg_scr_pop = sentiment(cleaned_pop_df)
print('pop genre: ', round(avg_scr_pop, 4), '\n\n' 'Song scores: ', sent_df_pop.head(10))
sent_df_rb, avg_scr_rb = sentiment(cleaned_rb_df)
print('rb genre: ', round(avg_scr_rb, 4), '\n\n' 'Song scores: ', sent_df_rb.head(10))
sent_df_country, avg_scr_country = sentiment(cleaned_country_df)
print('country genre: ', round(avg_scr_country, 4), '\n\n' 'Song scores: ', sent_df_country.head(10))
Latent Dirichlet Allocation is an approach to topic modeling. The goal of LDA is to discover hidden, or latent, topics within a set of documents housing text. In the context of our example, documents will represent songs, and text will represent the lyrics. Let’s assume we have K latent topics we are hoping to discover. In LDA, documents can be viewed as k-nomial distributions, where the distribution of k latent topics in each document is the probability of that document being from each latent topic. Our k latent topics as well can be viewed as distributions of each word being used in each topic. These two distributions can be estimated through an iterative process to create latent topics where suitable words and documents are grouped together, creating topics grouped by words, and documents given suitable topics based on those words.
LDA starts by assigning random topics to each word in each document. Then, the algorithm selects one word to update its topic classification. With the aforementioned distributions, the algorithm calculates the probability of each topic given the document (found by taking the counts of all the topics for all the other words in the document) and the probability of each word for each topic (found by taking the counts of each word with each topic across all documents) and multiplies them to find the probability that each topic generated that word. We then pick the most likely topic and assign the word that new topic. This process is repeated for all words in all documents, and then iterated over to reach a steady state of latent topics.
Through each iteration, topic classifications are made based upon how well a word fits a topic, and how well that topic fits the document. Because initial assignments are random, our topic distributions will do a poor job at assigning words new topics, but eventually suitable words and topics will be paired together through the topic assignments. This will create a topic classification distribution for every document, as well as generating sets of common words for topics.
Below is a graphical model representing the Latent Dirichlet Allocation we are performing. Without getting into all of the details, you can see what variables are what and understand how each part is estimated
Θ: Topic mixes for each document Z: Topic assignment for each word W: word in each document β: distribution of words for each topic N: Number of words M: Number of documents 𝜶: distribution of topics in documents 𝜂: distribution of words in topics
https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#12buildingthetopicmodel
Note: For this tutorial we use datasets with 118-198 observations, combined in a full dataset with 896 observations. However; “You should use at least 1,000 documents in each topic modeling job. Each document should be at least 3 sentences long. If a document consists of mostly numeric data, you should remove it from the corpus.” - From https://docs.aws.amazon.com/comprehend/latest/dg/topic-modeling.html
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
LDA coherence: https://rare-technologies.com/what-is-topic-coherence/
def LDA_topics(df = cleaned_full_df,n_topics=10,top_n_words=15):
# Split song lyrics into individual strings
sep_lyrics_list = []
for song in df.lyrics:
sep_lyrics = song.split()
sep_lyrics_list.append(sep_lyrics)
# Bigram model -- two words frequently occurring together in a song
bigram_init = gensim.models.Phrases(sep_lyrics_list)
bigram_model = gensim.models.phrases.Phraser(bigram_init)
lyrics_bigrams = [bigram_model[lyric] for lyric in sep_lyrics_list]
# Create corpus dictionary
id2word = corpora.Dictionary(lyrics_bigrams)
# Term frequency
corpus = [id2word.doc2bow(bigram) for bigram in lyrics_bigrams]
# LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=n_topics,
random_state=1,
update_every=1,
chunksize=100,
passes=20,
alpha='auto',
per_word_topics=True)
# Perplexity: measure of model performance (the lower the value the better the performance)
print('Perplexity: ', lda_model.log_perplexity(corpus))
coherence_model_lda = CoherenceModel(model=lda_model,
texts=lyrics_bigrams,
dictionary=id2word,
coherence='c_v')
# Coherence: measure of interpretability
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)
topic_nums = []
words = []
# Create list of topics and nested list of words for data frame
for index, topic in lda_model.show_topics(formatted=False,
num_words=top_n_words,
num_topics=n_topics):
topic_nums.append(index)
words.append([word[0] for word in topic])
# Initial data frame -- formatting needed
init_df = pd.DataFrame({'Topic':topic_nums,
'Words': words})
# Split "Word" column into len(top_n_words) columns
split_words_df = pd.DataFrame(init_df['Words'].to_list(),
columns=['Word ' + str(i) for i in range(top_n_words)])
split_words_df['Topic'] = topic_nums
# Reorder the columns (Topic first)
cols = split_words_df.columns.tolist()
cols = cols[-1:] + cols[:-1]
# Final LDA column
LDA_df = split_words_df[cols]
return LDA_df
LDA_topics(cleaned_full_df)
## Perplexity: -8.163395244557206
## Coherence Score: 0.35423024601321873
## Topic Word 0 Word 1 Word 2 Word 3 \
## 0 0 wa remember_well night keep
## 1 1 run_away go let look_around
## 2 2 know love want get
## 3 3 crazy inside_head switchin_side nothing
## 4 4 know gon get girl
## 5 5 bam_bam ey_ey girl el
## 6 6 want baby love run
## 7 7 tell way going good
## 8 8 dokie dual chuckie cigarettes_cigarette
## 9 9 young_dumb dumb_broke badeeya young_young
##
## Word 4 Word 5 Word 6 Word 7 Word 8 \
## 0 day new_york back feelin_feelin queen
## 1 na_na hey_hey going work work_work
## 2 never say go would wa
## 3 head well think_crazy crazy_kind whoo
## 4 wild_wild one hol_hol going good
## 5 bam_ey ey_bam bam_dilla know_hotline bling_mean
## 6 money get let_u go nah
## 7 see life thing nothing one
## 8 cocoa_butter cointel colin coyote craziest
## 9 kill_vibe vibe_kill baduda badeeya_deeya school_kid
##
## Word 9 Word 10 Word 11 \
## 0 never know low
## 1 daddy ever daddy_said
## 2 baby time one
## 3 stop_holdin dyin wanted
## 4 feel back god
## 5 que tu de
## 6 gon girl take
## 7 better_man low lost
## 8 dank_miss deadbeat delegated
## 9 broke_high yadadadadadadada_yadadadadadada yadadadadadadada_young
##
## Word 12 Word 13 Word 14
## 0 wa_rare ring_fire remember
## 1 uhoh try always_stay
## 2 see let feel
## 3 hope_die oohooh tryin_save
## 4 careful hit drippin_finesse
## 5 mi country_road bo
## 6 murder_mind boy loyalty_loyalty
## 7 back take wish
## 8 chlorophyll dougie huggy
## 9 remain deeya badu
LDA_topics(cleaned_rap_df)
## Perplexity: -8.108736558304567
## Coherence Score: 0.28773929568832246
## Topic Word 0 Word 1 Word 2 Word 3 Word 4 \
## 0 0 el monster slay slay_slay que
## 1 1 know want need get girl
## 2 2 know love back wa boy
## 3 3 bish enemy_lot bought oohoohoohoohooh way
## 4 4 know want get feel need
## 5 5 want wa get know love
## 6 6 inside_dna witchu give_give fine dna
## 7 7 never go get make still
## 8 8 get versace_versace hit gucci_gang go
## 9 9 know get back man see
##
## Word 5 Word 6 Word 7 Word 8 Word 9 Word 10 \
## 0 see_hand medusa lifestyle said beginnin top
## 1 real say never love see make
## 2 get let let_go still girl one
## 3 strong ohohohoh front_gun see wave drain
## 4 time go love would make life
## 5 would give back go say time
## 6 remy_boyz anything_give lil_stupid yeaaah rot rewind
## 7 way put give take cocoa_butter back
## 8 want new_york put uh back gon_alright
## 9 want go one wa girl tell
##
## Word 11 Word 12 Word 13 Word 14
## 0 baby thugga de take
## 1 money go bad tell
## 2 made think right thought
## 3 standing duckworth dollar_might nah_dollar
## 4 one right take say
## 5 life night make never
## 6 loyalty sewed monty wit
## 7 keep say til could
## 8 know make man right
## 9 going money never gon
LDA_topics(cleaned_rock_df)
## Perplexity: -7.428911702453029
## Coherence Score: 0.3866658150907866
## Topic Word 0 Word 1 Word 2 Word 3 \
## 0 0 know wa would love
## 1 1 ever_wa go letting_day water
## 2 2 make high want day_die
## 3 3 na_na say well get_high
## 4 4 get know go never
## 5 5 go run_away need_run let
## 6 6 want go see never
## 7 7 save_heavydirtysoul bennie_bennie know watermelon_sugar
## 8 8 one say would let
## 9 9 going get every walk_alone
##
## Word 4 Word 5 Word 6 Word 7 Word 8 \
## 0 one see way let want
## 1 water_flowing time coming_coming away one
## 2 life know give friend let_vibe
## 3 baby good want feel take
## 4 see baby love another_one going_quit
## 5 take_money long could know_daddy make
## 6 mine take love get u
## 7 save_save want bennie high_watermelon go
## 8 make thing hey_jude might_also love
## 9 know getting_dizzy natural walk wa
##
## Word 9 Word 10 Word 11 Word 12 Word 13 Word 14
## 0 never time back get go life
## 1 far wa may shoot_shoot start_fire since_world
## 2 away nobody_drag back music said taken
## 3 know said tell guess look god
## 4 could hey bite_dust said ohoh make
## 5 girl thing take stay feel never
## 6 say purple_rain know always feel need
## 7 jets sugar_high jets_bennie therefore mad say
## 8 run shut_mouth better free_fallin feel thinkin_much
## 9 heart love never light full take_hand
LDA_topics(cleaned_pop_df)
## Perplexity: -7.4807874561033465
## Coherence Score: 0.3089222477529462
## Topic Word 0 Word 1 Word 2 Word 3 Word 4 \
## 0 0 know wa want love never
## 1 1 know make feelin_feelin going get
## 2 2 want something back go doo
## 3 3 get girl go baby going
## 4 4 know one baby love get
## 5 5 hol_hol ah love_sent day_christmas true
## 6 6 know want get would see
## 7 7 one want know want_alive time
## 8 8 love want night way often
## 9 9 love know baby want life
##
## Word 5 Word 6 Word 7 Word 8 \
## 0 get way good go
## 1 baby never que girl
## 2 doodoodoo_doodoodoo na_nana na ceiling_hold
## 3 take know want right
## 4 going back thing cry
## 5 partridge_pear look done_starboy turtle_dove
## 6 say tell make time
## 7 mind pickin_loving tear_left girl_bummer
## 8 time call_name keep run
## 9 baby_baby could feel see
##
## Word 9 Word 10 Word 11 Word 12 Word 13 \
## 0 see let baby would say
## 1 let hot back tu good
## 2 fight_til moment_tonight u_ceiling time_fore go_higher
## 3 back make see love body
## 4 way run_away let look_look would
## 5 three_french hen_two tree young thrill
## 6 one feel take could let
## 7 throw_tantrum hate_hot anthem_turn feel turnin
## 8 girl need might make_earth pipe
## 9 going mine knew young_dumb stay
##
## Word 14
## 0 could
## 1 read_read
## 2 power_taking
## 3 say
## 4 alright
## 5 need
## 6 go
## 7 livin_pickin
## 8 call
## 9 heart
LDA_topics(cleaned_rb_df)
## Perplexity: -7.319789081104805
## Coherence Score: 0.31427322069409414
## Topic Word 0 Word 1 Word 2 Word 3 Word 4 \
## 0 0 flawless feelin_feelin lie god_damn die
## 1 1 know baby time way want
## 2 2 often hol_hol mornin turn matter
## 3 3 thank_next one_two found look_around work_work
## 4 4 murder_mind wild_wild wild get wild_thought
## 5 5 baby want see feel know
## 6 6 love life back know one
## 7 7 know nothing feel baby go
## 8 8 love know want get girl
## 9 9 one know make say_name girl
##
## Word 5 Word 6 Word 7 Word 8 Word 9 Word 10 \
## 0 look_sexy end go b said never
## 1 wa love let mind feel say
## 2 side make girl hello anymore want
## 3 worried_bout chandelier hey_hey greatest_city three turned
## 4 god king church_wild mob alright beat
## 5 halo heartless way tell halo_halo need
## 6 girl take time run_away go need
## 7 bring want love girl going one
## 8 say need wa never would baby
## 9 new_york let bam_bam life ey_ey run
##
## Word 11 Word 12 Word 13 Word 14
## 0 swear cross_heart take_element hope
## 1 girl right go one
## 2 say til time sorry
## 3 three_drink corny_wish doe tonight_holding
## 4 human white rockin coke
## 5 girl time life light
## 6 wa let could want
## 7 get make need really
## 8 thing see good feel
## 9 never live love made
LDA_topics(cleaned_country_df)
## Perplexity: -7.316227106159335
## Coherence Score: 0.3988602773158466
## Topic Word 0 Word 1 Word 2 Word 3 \
## 0 0 wa remember_well say_something god
## 1 1 red still nothing would
## 2 2 take wa love go
## 3 3 wish never wa road
## 4 4 better_man country_music miss_wish jolene_jolene
## 5 5 wild_horse could_drag away wild
## 6 6 know time never love
## 7 7 get_back wa love see
## 8 8 wa one texas big_iron
## 9 9 want know home get
##
## Word 4 Word 5 Word 6 Word 7 Word 8 \
## 0 wind_hair wa_rare remember still something
## 1 old_town ride_til take_horse road_going nothing_tell
## 2 get one would right name
## 3 every high home kid died
## 4 long_live hold know could still
## 5 ride ring_fire thing let meant_baby
## 6 say go away day turn
## 7 make last man back thing
## 8 town feleena men made ranger
## 9 treacherous time think follow_follow eye
##
## Word 9 Word 10 Word 11 Word 12 Word 13 Word 14
## 0 ana stair_wa maybe_looking maybe say nothing
## 1 nobody_tell taste_tequila ridin sky among youth
## 2 going left never know time back
## 3 grandpa back time lost cold cooler
## 4 please would say_goodbye become magic wa
## 5 clementine went might_also freedom pain lie
## 6 hold life much baby always_stay light
## 7 say go ohoh ever alone going
## 8 hip_big would daughter tried outlaw iron_hip
## 9 slope iii safe would alone say
Another approach to topic modeling is BERTopic. This is a transformer based model that creates dense clusters allowing us to create interpretable topics that include important words to these topics. Transformers are nice tools to use because they use models that are already created for us and fine tune them to our data. This helps because oftentimes the data that we have isn’t large enough to train a whole model and we also generally don’t have access to powerful enough GPU’s to train these models.
The way that BERT works is by converting the text documents into numerical data using the transformers. The embedded pre-trained models are now updated and fine tuned with the data that we include. We can use sentence-transformers to carry this out.
After transforming the data with the pre-trained models, we want to do some clustering to get the documents with similar topics to cluster together. First to deal with high dimensionality in clustering, we want to reduce the dimensionality finding the right balance between too low, lost information, and too high poor clustering. Once we have a lowered dimensionality we can make our clusters which will end up being the topics that we are looking for. Popular choices for carrying out this clustering are UMAP for reducing the dimensionality and HDBSAN for forming the clusters.
After forming these clusters the next obvious step is to figure out what these clusters represent, with this we are basically comparing the importance of the words between the different documents. A common way to do this is with TF-IDF, and in this case clustered TF-IDF. Instead of looking at the importance of each word compared to its document, you take the clustered documents and look at the importance of the word within the cluster of documents where it appears. With these importances we can get the top 20 or so scores for words which would give us a good idea of the topic that we are looking at.
There are a couple of common issues that people run into when carrying out BERTopic modeling. For example, a lot of transformers have limits on the size of documents so we might have to split documents down into paragraphs, or in the case of songs down into verses. Another issue is that in the end we might end up with a lot of clusters so we would have to do some topic reduction. To do this we would adjust the min_cluster_size in HDBSAN to ensure that we have less topics which would end up being more meaningful.
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic import BERTopic
import random
import plotly.io as pio
We were able to implement this code in time - BERTopic models take a very long time
data = cleaned_full_df.lyrics.to_list()
datasmall = data
#datasmall = random.sample(data, 200)
topic_model = BERTopic() # create model
model = BERTopic(nr_topics=20) # specify dimensions
topics, probs = model.fit_transform(datasmall) # fit model
# Get top 10 topics and the associated words
model.get_topic(1)
## [('versace', 0.3466758867560014), ('comin', 0.2535267531969895), ('feel', 0.21283023797881748), ('babe', 0.1319302262124721), ('baby', 0.11843354472520784), ('dusk', 0.10451005625699201), ('dawn', 0.0972775882532528), ('till', 0.09494247304884539), ('girl', 0.06730301111162452), ('love', 0.06553494133301181)]
model.get_topic(2)
## [('get', 0.037234276978791815), ('know', 0.03353845096947296), ('want', 0.029842139035109034), ('one', 0.026727204347575355), ('go', 0.024929728120388194), ('back', 0.02324432085522547), ('see', 0.020235305205870167), ('love', 0.020062295595921306), ('let', 0.01957727594371185), ('girl', 0.01940712932636818)]
model.get_topic(3)
## [('love', 0.05332289698815026), ('know', 0.0513788165105225), ('wa', 0.04185401605081198), ('would', 0.039284996767431496), ('want', 0.0382747887939472), ('never', 0.03663786012062019), ('say', 0.031601318228263474), ('go', 0.03110966809236601), ('time', 0.03040700571206907), ('see', 0.030226845920717792)]
pio.show(model.visualize_topics()) # Inter-topic distance map
pio.show(model.visualize_barchart()) # visualize topics with top words
pio.show(model.visualize_heatmap()) # Visual topic similarity
def album_songs(song_ids):
# Takes in a list of song ids and return a df of every song in each on the input songs' album
albums = list()
songs = list()
artists = list()
new_song_ids = list()
for song in song_ids:
t = True
while t == True:
try:
# Search song API with parameters
song_info = genius.song(song)
except:
pass
else:
t = False
if song_info['song']['album'] != None:
album_id = song_info['song']['album']['id']
album_name = song_info['song']['album']['name']
album_artist = song_info['song']['artist_names']
t = True
while t == True:
try:
# Search song API with parameters
album_dict = genius.album_tracks(album_id)
except:
pass
else:
t = False
len_album = len(album_dict['tracks'])
for track in range(len_album):
song_id = album_dict['tracks'][track]['song']['id']
song_name = album_dict['tracks'][track]['song']['title']
artists.append(album_artist)
albums.append(album_name)
songs.append(song_name)
new_song_ids.append(song_id)
df = pd.DataFrame({'song': songs, 'album':albums,'artist':artists,'song_ids':new_song_ids})
return df
def make_dataset(genre, time_period = 'all_time',n_per_page=20, pages=1):
df =top_charts(genre='all',time_period=time_period,n_per_page=n_per_page,pages=2)
df = album_songs(df['song_ids'])
myDict = {k: v for k, v in zip(df['song_ids'], df['album'])}
df = song_info(df['song_ids'])
new_albums = list()
for i in df['song_ids']:
for j in myDict.keys():
if i == j:
new_albums.append(myDict.get(j))
df = pd.concat([df,
pd.Series(new_albums, name = 'album',dtype='float64')], axis=1)
csv_name = 'topcharts_'+time_period+genre
path = os.getcwd()
df.to_csv(path+csv_name+'.csv')
return df
Flair implementation for text classification